Purpose

This script takes a deep dive into Sentinel 2 labels for a more rigorous analysis of inconsistent band data and outliers in the filtered label dataset. Here we will determine if any more label data points should be removed from the training dataset and whether or not we can glean anything from the metadata in the outlier dataset to be able to pre-emptively toss out scenes when we go to apply the classification algorithm.

harmonize_version = "2024-04-25"
outlier_version = "2024-04-25"

S2 <- read_rds(paste0("data/labels/harmonized_SEN2_labels_", harmonize_version, ".RDS"))

Check for mis-matched band data between user data and re-pull

Just look at the data to see consistent (or inconsistent) user-pulled data and our pull, here, our user data are in “BX” format and the re-pull is in “SR_BX” format. These are steps to assure data quality if the volunteer didn’t follow the directions explicitly.

pmap(.l = list(user_band = S2_user,
               ee_band = S2_ee,
               data = list(S2),
               mission = list("SENTINEL_2")),
     .f = make_band_comp_plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

Handful of fliers here, we’ll compare across the user-pulled and re-pulled data for consistency.

S2_inconsistent <- S2 %>% 
  filter(B2 != SR_B2 | B3 != SR_B3 | 
           B4 != SR_B4 | B5 != SR_B5 | B6 != SR_B6 | B7 != SR_B7 | 
           B11 != SR_B11 | B12 != SR_B12)

S2_inconsistent %>% 
  group_by(class) %>% 
  summarise(n_labels = n()) %>% 
  kable()
class n_labels
algalBloom 2
cloud 53
darkNearShoreSediment 3
lightNearShoreSediment 13
offShoreSediment 20
openWater 40
other 4
shorelineContamination 2

We should see if there is any common contributor or scene here. There were no data handling differences between the user-pull and our second pull, so we can’t dismiss the cloud differences.

S2_inconsistent %>% 
  group_by(vol_init) %>% 
  summarise(n_dates = length(unique(date)),
            n_labels = n()) %>% 
  kable()
vol_init n_dates n_labels
ANK 7 24
BJM 9 51
JFL 2 6
LAE 1 7
LRCP 3 37
SKS 2 9
TJB 1 3

The inconsistencies are spread across many dates, so I don’t think there is any special handling necessary here except for removing these from the dataset.

S2_filtered <- S2 %>% 
  anti_join(S2_inconsistent) %>% 
  filter(# or where any re-pulled band value is greater than 1, which isn't a valid value
         if_all(S2_ee,
                ~ . <= 1))

And plot:

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

And now let’s look at the data by class:

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

We aren’t actually modeling “other” (not sufficient observations to classify) or “shorelineContamination” (we’ll use this later to block areas where there is likely shoreline contamination in the AOI). Additionally, “algalBloom” is likely a processing issue, and n is too low to use that data, so we’ll drop that class as well. Finally, we’ll drop the user band values since we’re done with comparisons.

S2_for_class_analysis <- S2_filtered %>% 
  filter(!(class %in% c("other", "shorelineContamination", "algalBloom"))) %>% 
  select(-c(B2:B12))
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

Outlier handling

There are statistical outliers within this dataset and they may impact the interpretation of any statistical testing we do. Let’s see if we can narrow down when those outliers and/or glean anything from the outlier data that may be applicable to the the application of the algorithm. Outliers may be a systemic issue (as in the scene is an outlier), it could be a user issue (a user may have been a bad actor), or they just might be real. This section asks those questions. The “true outliers” that we dismiss from the dataset will also be used to help aid in interpretation/application of the algorithm across the Landsat stack, so it is important to make notes of any patterns we might see in the outlier dataset.

## [1] "Classes represented in outliers:"
## [1] "darkNearShoreSediment"  "lightNearShoreSediment" "offShoreSediment"      
## [4] "openWater"

Okay, 24 outliers (>1.5*IQR) out of 999 - and they are all from non-cloud groups.

How many of these outliers are in specific scenes?

S2_out_date <- outliers %>% 
  group_by(date, vol_init) %>% 
  summarize(n_out = n())
S2_date <- S2_for_class_analysis %>% 
  filter(class != "cloud") %>% 
  group_by(date, vol_init) %>% 
  summarise(n_tot = n())
S2_out_date <- left_join(S2_out_date, S2_date) %>% 
  mutate(percent_outlier = n_out/n_tot*100) %>% 
  arrange(-percent_outlier)
S2_out_date %>% 
  kable()
date vol_init n_out n_tot percent_outlier
2022-06-23 BJM 6 24 25.000000
2023-04-11 BJM 6 30 20.000000
2019-06-06 LRCP 9 148 6.081081
2020-09-28 BJM 1 32 3.125000
2021-08-29 ANK 1 62 1.612903
2019-05-12 LRCP 1 86 1.162791

There are two images with higher outliers, let’s look at them quickly:

2022-06-23

There’s some weird stuff happening here - while this does look like some near shore sediment, I think there are also some aerosol issues with this scene. If you zoom in on other waterbodies on this date, they have a similar grey-brown sheen to them, and if you zoom in on this image, it doesn’t look… natural?

2023-04-11

This is a really cool looking scene. Lots of things going on here: 1) the end of ice break up 2) super hazy/cirrus clouds 3) snow on the ground might be affecting the SR processes. We should probably ditch this scene, but we’ll do that at the end.

Tile-level metadata

Let’s look at the tile-level metadata for the highest scenes here:

S2_out_date %>% 
  filter(percent_outlier >= 20) %>% 
  select(date, vol_init) %>% 
  left_join(., S2) %>% 
  select(date, vol_init, GENERAL_QUALITY:RADIOMETRIC_QUALITY, 
         SNOW_ICE_PERCENTAGE_mean,
         SNOW_ICE_PERCENTAGE_max,
         SNOW_ICE_PERCENTAGE_min) %>% 
  distinct() %>% 
  kable()
date vol_init GENERAL_QUALITY GEOMETRIC_QUALITY RADIOMETRIC_QUALITY SNOW_ICE_PERCENTAGE_mean SNOW_ICE_PERCENTAGE_max SNOW_ICE_PERCENTAGE_min
2022-06-23 BJM PASSED PASSED PASSED 0.000000 0.00000 0.000000
2023-04-11 BJM PASSED PASSED PASSED 9.856787 28.13553 0.000361

Some detection of snow/ice in the April image, but since these are aggregated values across multiple tiles, hard to tell what to glean from this.

Clouds

How many of these outliers have tile-level cloud metadata with high values?

The outlier dataset contains 7 (29.2%) where the mean pixel cloud cover was > 60% and 6 (25%) where the mean pixel cirrus cover was > 20%. The filtered dataset contains 97 (9.7%) where the mean pixel cloud cover was > 60% and 47 (4.7%) where the mean pixel cirrus cover was > 20%. This is a little more helpful than in Landsat images. Higher pixel cloud cover and cirrus tend to be in outliers than in the filtered dataset.

QA Pixels

Do any of the labels have QA pixel indications of opaque cloud or cirrus?

S2_for_class_analysis %>% 
  mutate(QA = case_when(cirrus == 1 | cirrus_scl == 1 ~ "cirrus",
                        hi_prob_cloud == 1 | med_prob_cloud == 1 ~ "cloud",
                        snow_ice == 1 ~ "snow ice",
                        opaque == 1 ~ "opaque cloud",
                        cloud_shadow == 1 ~ "cloud shadow",
                        dark_pixel == 1 ~ "dark pix",
                        TRUE ~ "clear")) %>% 
  group_by(QA) %>% 
  filter(class != "cloud") %>% 
  summarize(n_tot = n()) %>% 
  kable()
QA n_tot
cirrus 14
clear 494
opaque cloud 10

Let’s look at the cirrus and opaque cloud group to see if there is anything egregious:

S2_tot <- S2_for_class_analysis %>% 
  group_by(date, vol_init) %>% 
  summarise(n_tot_labels = n())
S2_for_class_analysis %>% 
  filter(cirrus == 1 | cirrus_scl == 1 | opaque == 1, 
         class != "cloud") %>% 
  group_by(date, vol_init) %>% 
  summarise(n_qa_flag = n()) %>% 
  left_join(S2_tot) %>%
  mutate(perc_qa_flag = round(n_qa_flag/n_tot_labels*100, 1)) %>% 
  arrange(-perc_qa_flag) %>% 
  kable()
date vol_init n_qa_flag n_tot_labels perc_qa_flag
2020-10-26 BJM 9 37 24.3
2023-04-11 BJM 12 71 16.9
2020-09-28 BJM 2 103 1.9
2019-05-12 LRCP 1 109 0.9

The top tow images here are the same from the outlier analysis. I think we’ll trust the QA bit to help with all this.

Aerosol band

Sentinel 2 doesn’t have an aerosol QA flag, but SR_B1 is aerosol, so we can look at distributions to examine this. Let’s look at the distribution across all data:

S2_for_class_analysis %>% 
  ggplot(., aes(x = SR_B1)) +
  geom_histogram(binwidth = 0.01) +
  facet_grid(class ~ .) +
  theme_bw() +
  scale_x_continuous(breaks = seq(0, max(S2_for_class_analysis$SR_B1), 0.05))

B1_summary <- S2_for_class_analysis %>% 
  group_by(class) %>% 
  summarise(fifth_B1 = quantile(SR_B1, 0.05),
            mean_B1 = mean(SR_B1),
            ninetyfifth_B1 = quantile(SR_B1, 0.95))

kable(B1_summary)
class fifth_B1 mean_B1 ninetyfifth_B1
cloud 0.168800 0.5004647 0.860100
darkNearShoreSediment 0.011015 0.0320797 0.048465
lightNearShoreSediment 0.033250 0.0546151 0.088225
offShoreSediment 0.027685 0.0422233 0.058135
openWater 0.017870 0.0293747 0.044230

Let’s see how many of the oultiers have Rrs values higher than the 95th percentile in their class:

outliers %>% 
  left_join(B1_summary) %>% 
  filter(SR_B1 > ninetyfifth_B1, class != "cloud") %>% 
  nrow()
## [1] 13

This is about 50% of the outliers, so a pretty decent QA measure.

Training dataset implications

For the purposes of training data, we are going to throw out any labels that have qa flags associated with them

S2_training_labels <- S2_for_class_analysis %>% 
  filter(if_all(c("cirrus", "cirrus_scl", "cloud_shadow", 
                    "dark_pixel", "hi_prob_cloud", "med_prob_cloud",
                    "opaque", "snow_ice"),
                  ~ . == 0) |
           class == "cloud") %>% 
  filter(!(date %in% c("2022-06-23", "2023-04-11")))

Let’s now see how many of these have high percentile SR_B1:

S2_training_labels %>% 
  left_join(B1_summary) %>% 
  filter(SR_B1 > ninetyfifth_B1, class != "cloud") %>% 
  nrow()
## [1] 14

And make sure none of them are concentrated in a specific scene:

S2_training_labels %>% 
  left_join(B1_summary) %>% 
  filter(SR_B1 > ninetyfifth_B1, class != "cloud") %>% 
  group_by(date, vol_init) %>% 
  summarise(n_high_aero = n())
## # A tibble: 6 × 3
## # Groups:   date [6]
##   date       vol_init n_high_aero
##   <date>     <chr>          <int>
## 1 2019-05-12 LRCP               2
## 2 2019-06-06 LRCP               3
## 3 2020-09-28 BJM                1
## 4 2020-10-26 BJM                2
## 5 2020-11-30 BJM                1
## 6 2022-11-30 BJM                5

This seems totally reasonable.

Testing for inter-class differences

We do want to have an idea of how different the classes are, in regards to band data. While there are a bunch of interactions that we could get into here, for the sake of this analysis, we are going to analyze the class differences by band.

Kruskal-Wallis assumptions:

  1. Data are non-Normal or have a skewed distribution
  2. There must be at least two independent groups.
  3. Data have a similar distribution across groups.
  4. Data are independent, the groups shouldn’t have a relationship to one another
  5. Each group should contain at least 5 observations

ANOVA assumptions:

  1. data are distributed normally
  2. data have equal variances, if not equal variance perform Kruskal-Wallis
  3. data are independent
  4. variance across groups is similar

We can’t entirely assert sample independence and we know that variance and distribution is different for “cloud” labels, but those data also are visibly different from the other classes.

In order to systematically test for differences between classes and be able to interpret the data, we will need to know some things about our data:

  1. Are the data normally distributed (Shapiro-Wilkes)?
  2. Are there outliers that may impact interpretation?
  3. If data is non-normal, perform Kruskal-Wallis test; otherwise ANOVA if equal variances, otherwise, back to Kruskal-Wallis
  4. if the null is rejected (and there is a difference in at least one class), perform post-hoc test for pairwise comparison (Dunn test for both)

With this workflow, most classes are statistically different - below are the cases where the pairwise comparison were not deemed statistically significant:

## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
band group1 group2 n1 n2 statistic p p.adj p.adj.signif
SR_B2 darkNearShoreSediment offShoreSediment 64 117 2.7190570 0.0065468 0.0654683 ns
SR_B2 darkNearShoreSediment openWater 64 154 -1.6152259 0.1062618 1.0000000 ns
SR_B2 lightNearShoreSediment offShoreSediment 102 117 -2.6802513 0.0073567 0.0735669 ns
SR_B3 darkNearShoreSediment offShoreSediment 64 117 1.2549300 0.2095042 1.0000000 ns
SR_B4 darkNearShoreSediment lightNearShoreSediment 64 102 2.4622904 0.0138053 0.1380529 ns
SR_B4 darkNearShoreSediment offShoreSediment 64 117 -1.6410142 0.1007945 1.0000000 ns
SR_B5 darkNearShoreSediment lightNearShoreSediment 64 102 2.0770150 0.0378002 0.3780018 ns
SR_B5 darkNearShoreSediment offShoreSediment 64 117 -2.1130300 0.0345982 0.3459820 ns
SR_B6 darkNearShoreSediment lightNearShoreSediment 64 102 1.4614875 0.1438817 1.0000000 ns
SR_B6 darkNearShoreSediment offShoreSediment 64 117 -1.9932388 0.0462353 0.4623530 ns
SR_B7 darkNearShoreSediment lightNearShoreSediment 64 102 1.3916638 0.1640242 1.0000000 ns
SR_B7 darkNearShoreSediment offShoreSediment 64 117 -1.9394492 0.0524467 0.5244667 ns
SR_B8 darkNearShoreSediment lightNearShoreSediment 64 102 1.3596677 0.1739351 1.0000000 ns
SR_B8 darkNearShoreSediment offShoreSediment 64 117 -1.9968907 0.0458371 0.4583706 ns
SR_B8A darkNearShoreSediment lightNearShoreSediment 64 102 1.2685325 0.2046079 1.0000000 ns
SR_B8A darkNearShoreSediment offShoreSediment 64 117 -1.6112531 0.1071246 1.0000000 ns
SR_B11 darkNearShoreSediment lightNearShoreSediment 64 102 0.5292842 0.5966083 1.0000000 ns
SR_B11 darkNearShoreSediment offShoreSediment 64 117 -1.6296461 0.1031763 1.0000000 ns
SR_B11 lightNearShoreSediment offShoreSediment 102 117 -2.4933917 0.0126529 0.1265292 ns
SR_B11 offShoreSediment openWater 117 154 -2.4124536 0.0158456 0.1584555 ns
SR_B12 darkNearShoreSediment lightNearShoreSediment 64 102 0.4650300 0.6419100 1.0000000 ns
SR_B12 darkNearShoreSediment offShoreSediment 64 117 -1.4404613 0.1497369 1.0000000 ns
SR_B12 lightNearShoreSediment offShoreSediment 102 117 -2.2006275 0.0277624 0.2776241 ns
SR_B12 offShoreSediment openWater 117 154 -1.8261056 0.0678344 0.6783435 ns

Eep, this isn’t great, but the pairwise confusion seems concentrated in dark near shore sediment.

Let’s look at the boxplots of the non-cloud classes:

S2_training_labels_no_clouds <- S2_training_labels %>% 
  filter(class != "cloud")
pmap(.l = list(data = list(S2_training_labels_no_clouds),
               data_name = list("SENTINEL_2"),
               band = S2_ee_all),
     .f = make_class_comp_plot)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

This actually makes me feel a little better. I think there might be enough here for ML to differentiate between classes.

There are in SR_B1, 0 in SR_B2, 0 in SR_B3, 1 in SR_B4, 1 in SR_B5, 1 outliers in SR_B6, 1 in SR_B7, 1 in SR_B8, 0 in SR_B8A, 10 in SR_B11, and 13 in SR_B12.

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

Let’s zoom in on the sediment classes.

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment

Aggregating sediment classes and performing statistical tests

As a back up, we should consider using aggregated sediment classes, where any label of sediment is treated as a general class of “sediment”. Let’s do the same process here to test for class significance.

## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## [1] "Data are non-normal, performing Kruskal-Wallis"
## [1] "Kruskal-Wallis detected significant differences between groups,\n            running post-hoc test of multiple pairwise comparisons."
## # A tibble: 0 × 9
## # ℹ 9 variables: band <chr>, group1 <chr>, group2 <chr>, n1 <int>, n2 <int>,
## #   statistic <dbl>, p <dbl>, p.adj <dbl>, p.adj.signif <chr>

And let’s look at the scatter plots here:

And if we drop the cloud:

S2_training_labels %>% 
  filter(agg_class != "cloud") %>% 
ggpairs(., columns = S2_ee_all, aes(color = agg_class)) + 
  scale_color_colorblind() +
  scale_fill_colorblind() +
  theme_few()

Export the training labels

Things to note for Sentinel 2:

  • any QA flagged pixel should be dissmissed before applying the algorithm
  • SR_B1 values greater than 0.04 for open water and dark near shore sediment, 0.05 for offshore sediment, and 0.08 for light near shore sediment may have more uncertainty associated with their labels.
  • bright cloud cover and snow may impact Rrs within the waterbody leading to outliers. will need to be cautious applying the algo when snow is on the ground - might consider filtering out tiles with high (>10%) snow/ice cover
  • higher concentrations of cirrus clouds (via tile metadata) and cloud cover may merit additional uncertainty
write_rds(S2_training_labels, paste0("data/labels/S2_labels_for_tvt_", outlier_version, ".RDS"))